Bayesian Identification of Cognates and Correspondences
نویسنده
چکیده
Prague, June 2007. c ©2007 Association for Computational Linguistics Bayesian Identi ation of Cognates and Corresponden es T. Mark Ellison Linguisti s, University of Western Australia, and Analith Ltd mark markellison.net Abstra t This paper presents a Bayesian approa h to omparing languages: identifying ognates and the regular orresponden es that ompose them. A simple model of language is extended to in lude these notions in an a ount of parent languages. An expression is developed for the posterior probability of hild language forms given a parent language. Bayes' Theorem o ers a s hema for evaluating hoi es of ognates and orresponden es to explain semanti ally mat hed data. An implementation optimising this value with gradient des ent is shown to distinguish ognates from nonognates in data from Polish and Russian. Modern histori al linguisti s addresses questions like the following. How did language originate? What were histori ally-re orded languages like? How related are languages? What were the an estors of modern languages like? Re ently, omputation has be ome a key tool in addressing su h questions. Kirby (2002) gives an overview of urrent urrent work on how language evolved, mu h of it based on omputational models and simulations. Ellison (1992) presents a linguisti ally motivated method for lassifying onsonants as onsonants or vowels. An unexpe ted result for the dead language Gothi provides added weight to one of two ompeting phonologi al interpretations of the orthography of this dead language. Other re ent work has applied omputational methods for phylogeneti s to measuring linguisti distan es, and/or onstru ting taxonomi trees from distan es between languages and diale ts (Dyen et al., 1992; Ringe et al., 2002; Gray and Atkinson, 2003; M Mahon and M Mahon, 2003; Nakleh et al., 2005; Ellison and Kirby, 2006). A entral fo us of histori al linguisti s is the re onstru tion of parent languages from the eviden e of their des endents. In histori al linguisti s proper, this is done by the omparative method (Je ers and Lehiste, 1989; Ho k, 1991) in whi h shared arbitrary stru ture is assumed to re e t ommon origin. At the phonologi al level, re onstru tion identi es ognates and orresponden es, and then onstru ts sound hanges whi h explain them. This paper presents a Bayesian approa h to assessing ognates and orresponden es. Best sets of ognates and orresponden es an then be identi ed by gradient as ent on this evaluation measure. While the work is motivated by the eventual goal of o ering software solutions to histori al linguisti s, it also hopes to show that Bayes' theorem applied to an expli it, simple model of language an lead to a prin ipled and tra table method for identifying ognates. The stru ture of the paper is as follows. The next se tion details the notions of histori al linguisti s needed for this paper. Se tion 2 formally de nes a model of language and parent language. The subsequent se tion situates the work amongst similar work in the literature,
منابع مشابه
Identification of Cognates and Recurrent Sound Correspondences in Word Lists
Identification of cognates and recurrent sound correspondences is a component of two principal tasks of historical linguistics: demonstrating the relatedness of languages, and reconstructing the histories of language families. We propose methods for detecting and quantifying three characteristics of cognates: recurrent sound correspondences, phonetic similarity, and semantic affinity. The ultim...
متن کاملIdentifying Complex Sound Correspondences in Bilingual Wordlists
The determination of recurrent sound correspondences between languages is crucial for the identification of cognates, which are often employed in statistical machine translation for sentence and word alignment. In this paper, an algorithm designed for extracting non-compositional compounds from bitexts is shown to be capable of determining complex sound correspondences in bilingual wordlists. I...
متن کاملA Statistical Model for Lost Language Decipherment
In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and highlevel morphemic correspondences. This...
متن کاملDetermining Recurrent Sound Correspondences by Inducing Translation Models
I present a novel approach to the determination of recurrent sound correspondences in bilingual wordlists. The idea is to relate correspondences between sounds in wordlists to translational equivalences between words in bitexts (bilingual corpora). My method induces models of sound correspondence that are similar to models developed for statistical machine translation. The experiments show that...
متن کاملAn Algorithm For Identifying Cognates Between Related Languages
The algorithm takes as only input a llst of words, preferably but not necessarily in phonemic transcription, in any two putatively related languages, and sorts it into decreasing order of probable cognatlon. The processing of a 250-1tem bilingual list takes about five seconds of CPU time on a DEC KLI091, and requires 56 pages of core memory. The algorithm is given no information whatsoever abou...
متن کاملCreating a Comparative Dictionary of Totonac-Tepehua
We apply algorithms for the identification of cognates and recurrent sound correspondences proposed by Kondrak (2002) to the Totonac-Tepehua family of indigenous languages in Mexico. We show that by combining expert linguistic knowledge with computational analysis, it is possible to quickly identify a large number of cognate sets within the family. Our objective is to provide tools for rapid co...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007